Data wrangling III — pivoting
2025-02-17
In this video of the data wrangling topic you will
Converting between a wide and long representations is called pivoting
gather() & spread()An earlier implementation in tidyr 📦 used
gather()spread()roughly corresponding to
pivot_longer()pivot_wider()Many examples online and in 1st edition of r4ds used gather() and spread() but the pivot_*() functions are to be preferred now
pivot_longer()How does pivoting data to a longer format work?
Say we have a data set with three dogs (A, B, and C) and we have taken their blood pressure twice each
If we want to plot or work with the BP measurements, we need these in a single column (variable)
pivot_longer()Our aim is to have a data frame with three variables
dogmeasurementvalueWe achieve this with pivot_longer()
pivot_longer()pivot_longer()The id column is already a variable
It needs to be repeated once per column that we are pivoting
Here we are pivoting two columns, so we repeat each id twice
pivot_longer()The column names from dogs:
bp1bp2become the values in a new column, whose name was give by the names_to argument: "measurement"
The original column names need to be repeated once per row of the original data
pivot_longer()The cell values in the original data also become a new variable
The name of this new variable is given by the values_to argument: "value"
They are unwound, row by row
pivot_longer()dogs |>
pivot_longer(
cols = bp1:bp2, # which columns are we pivoting?
names_to = "measurement", # variable to create in long form
values_to = "value" # variable to hold the cell values
)
## # A tibble: 6 × 3
## id measurement value
## <chr> <chr> <dbl>
## 1 A bp1 100
## 2 A bp2 120
## 3 B bp1 140
## 4 B bp2 115
## 5 C bp1 120
## 6 C bp2 125Sometimes you will have data like this, with multiple variables per column
## # A tibble: 3 × 679
## country h_1996_Jan c_1996_Jan h_1996_Feb c_1996_Feb h_1996_Mar c_1996_Mar
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 England 91024 355 90735 392 90512 440
## 2 Wales 21296 152 21256 164 21195 182
## 3 Scotland 20080 32 20034 31 19989 31
## # ℹ 672 more variables: h_1996_Apr <dbl>, c_1996_Apr <dbl>, h_1996_May <dbl>,
## # c_1996_May <dbl>, h_1996_Jun <dbl>, c_1996_Jun <dbl>, h_1996_Jul <dbl>,
## # c_1996_Jul <dbl>, h_1996_Aug <dbl>, c_1996_Aug <dbl>, h_1996_Sep <dbl>,
## # c_1996_Sep <dbl>, h_1996_Oct <dbl>, c_1996_Oct <dbl>, h_1996_Nov <dbl>,
## # c_1996_Nov <dbl>, h_1996_Dec <dbl>, c_1996_Dec <dbl>, h_1997_Jan <dbl>,
## # c_1997_Jan <dbl>, h_1997_Feb <dbl>, c_1997_Feb <dbl>, h_1997_Mar <dbl>,
## # c_1997_Mar <dbl>, h_1997_Apr <dbl>, c_1997_Apr <dbl>, h_1997_May <dbl>, …
We have variables named h_1996_Jan and c_1996_Feb, where
h indicates if the data are number of herdsc indicates if the data are number of casespivot_longer()bovine_long <- bovine_wide |>
pivot_longer(
cols = !country, # which columns are we pivoting?
names_to = c("type", "year", "month"), # variables to create in long form
names_sep = "_", # split names at _
values_to = "n" # variable to hold the cell values
)
bovine_long
## # A tibble: 2,034 × 5
## country type year month n
## <chr> <chr> <chr> <chr> <dbl>
## 1 England h 1996 Jan 91024
## 2 England c 1996 Jan 355
## 3 England h 1996 Feb 90735
## 4 England c 1996 Feb 392
## 5 England h 1996 Mar 90512
## 6 England c 1996 Mar 440
## 7 England h 1996 Apr 90210
## 8 England c 1996 Apr 454
## 9 England h 1996 May 89357
## 10 England c 1996 May 434
## # ℹ 2,024 more rowspivot_longer()Conceptually this is not too different from what we already saw
Instead of the columns names pivoting into a single column, they pivot to multiple columns
pivot_wider()Having now separated the data into a long format we should see that there are multiple variables in the type column
The observations are at the level of country, by year and month
But the data for each observation are spread over two rows
pivot_wider()We can use pivot_wider() to pivot the data to a wider format to create n_herds and n_cases columns
bovine_long |>
pivot_wider(
id_cols = c(country, year, month), # which columns are we *not* pivoting
names_from = type, # column with names of new variables
values_from = n # column with values for new variables
) |>
rename(n_herds = h, n_cases = c) # rename for nicer names
## # A tibble: 1,017 × 5
## country year month n_herds n_cases
## <chr> <chr> <chr> <dbl> <dbl>
## 1 England 1996 Jan 91024 355
## 2 England 1996 Feb 90735 392
## 3 England 1996 Mar 90512 440
## 4 England 1996 Apr 90210 454
## 5 England 1996 May 89357 434
## 6 England 1996 Jun 88554 403
## 7 England 1996 Jul 87802 387
## 8 England 1996 Aug 87356 379
## 9 England 1996 Sep 86879 313
## 10 England 1996 Oct 86065 274
## # ℹ 1,007 more rowspivot_wider()To see how pivot_wider() works, we’ll revisit the small dog blood pressure data set, but this time we have only two dogs
We want to create a data frame with column names taken from "measurement" and fill the cells with the values from "value"
pivot_wider()We want to create a data frame with columns names taken from "measurement" and fill the cells with the values from "value"
pivot_wider() works — IFirst, pivot_wider() has to identify what the new columns will be
pivot_wider() works — IIBy default, the rows in the output are determined from those variables not going to new variables (names) or values
These are the id_cols and should uniquely identify a row in the new (wider) data frame
Here we only have 1 ID column
pivot_wider() works — IIIpivot_wider() then uses these combinations to create an empty data frame
Where bp1 etc were the distinct values of measurement
pivot_wider() works — IVpivot_wider() then fills in the new data frame with the relevant values
Not every dog had three BP measurements, so there is no value to fill in the data frame with
In that case the value is NA (missing)